{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Readability Exercise\n",
    "\n",
    "Welcome! Below you will implement two metrics for evaluating the readability of documents:\n",
    "\n",
    "1. Flesch–Kincaid readability Grade Index \n",
    "2. Gunning Fog Index\n",
    "\n",
    "The solutions are in [readability_solutions.py](./readability_solutions.py). You can also click the jupyter icon to see all the files in the folder.\n",
    "\n",
    "To load all the functions in the solutions, simply include `from solutions import *`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Initialization\n",
    "\n",
    "Let's read-in our text files. We have three different texts files to play with: \n",
    "\n",
    "1. `physics.txt`: taken from a technical wikipedia article about a theoretical physics idea called [Supersymmetry](https://en.wikipedia.org/wiki/Supersymmetry)\n",
    "\n",
    "2. `APPL_10k_2017.txt`: the 2017 10-K Item IA for APPLE INC, taken from the EDGAR website\n",
    "\n",
    "3. `alice.txt`: Excerpts from \"Alice in Wonderland\", the novel is in the public domain and freely available"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# download some excerpts from 10-K files\n",
    "\n",
    "from download10k import get_files\n",
    "\n",
    "CIK = {'ebay': '0001065088', 'apple':'0000320193', 'sears': '0001310067'}\n",
    "get_files(CIK['ebay'], 'EBAY')\n",
    "get_files(CIK['apple'], 'AAPL')\n",
    "get_files(CIK['sears'], 'SHLDQ')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# sentences separated by ; are better viewed as multiple sentences\n",
    "# join combines all the newlines in the file\n",
    "\n",
    "f = open(\"physics.txt\", \"r\")\n",
    "text_phy=''.join(f).replace(';','.')\n",
    "\n",
    "f = open(\"alice.txt\", \"r\")\n",
    "text_alice=''.join(f).replace(';','.')\n",
    "\n",
    "f = open(\"AAPL_10k_2017.txt\", \"r\")\n",
    "text_10k=''.join(f).replace(';','.')\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check out some of the texts\n",
    "print(text_10k[:500]+\"...\\n\")\n",
    "print(text_phy[:500]+\"...\\n\")\n",
    "print(text_alice[:500]+\"...\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Pre-processing\n",
    "Here, we need to define functions that can split our texts into sentences, and split our sentences into words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# tokenize and clean the text\n",
    "import nltk\n",
    "from nltk.stem import WordNetLemmatizer, SnowballStemmer\n",
    "from collections import Counter\n",
    "from nltk.corpus import stopwords\n",
    "\n",
    "from nltk import word_tokenize\n",
    "from syllable_count import syllable_count\n",
    "\n",
    "\n",
    "nltk.download('wordnet')\n",
    "\n",
    "sno = SnowballStemmer('english')\n",
    "wnl = WordNetLemmatizer()\n",
    "\n",
    "from nltk.tokenize import RegexpTokenizer\n",
    "\n",
    "# tokenizer that selects out non letter and non symbol (i.e. all alphabets)\n",
    "word_tokenizer = RegexpTokenizer(r'[^\\d\\W]+')\n",
    "\n",
    "\n",
    "def word_tokenize(sent):\n",
    "    return [ w for w in word_tokenizer.tokenize(sent) if w.isalpha() ]\n",
    "\n",
    "# for the sentence tokenizer\n",
    "nltk.download('punkt')\n",
    "from nltk.tokenize import sent_tokenize\n",
    "\n",
    "# you can tokenize sentences by calling\n",
    "# sent_tokenize(document)\n",
    "\n",
    "# and tokenize words by calling\n",
    "# word_tokenize(sentence)\n",
    "\n",
    "# syllable_count counts the number of syllables in a word\n",
    "# it's included in syllable_count.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now implement three functions\n",
    "\n",
    "1. `sentence_count`: a simple function that returns the number of sentences in a document\n",
    "\n",
    "2. `word_count`: a simple function that returns the number of words in a sentence\n",
    "\n",
    "3. `hard_word_count`: a simple function that returns the number of words with more than 3 syllables, while removing suffixes like \"-ed\", and \"-ing\". This can be done by lemmatizing the word, i.e. `wnl.lemmatize(word, pos='v')`\n",
    "\n",
    "the function `word_tokenize(sentence)` will be useful here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def sentence_count(text):\n",
    "    # TO DO\n",
    "    pass\n",
    "\n",
    "def word_count(sent):\n",
    "    # TO DO\n",
    "    pass\n",
    "\n",
    "def hard_word_count(sent):\n",
    "    # TO DO\n",
    "    pass\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Readability Grade-Levels\n",
    "\n",
    "Here, you will implement the two readability indices (grade levels). They are defined by\n",
    "\n",
    "\\begin{align}\n",
    "\\textrm{Flesch–Kincaid Grade} \n",
    "= 0.39 \\left(\n",
    "\\frac{\\textrm{Number of words}}{\\textrm{Number of sentences}}\n",
    "\\right) \\\\\n",
    "+11.8\n",
    "\\left(\n",
    "\\frac{\\textrm{Number of hard words}}{\\textrm{Number of words}}\n",
    "\\right)\n",
    "-15.59\n",
    "\\end{align}\n",
    "\n",
    "and\n",
    "\n",
    "\\begin{align}\n",
    "\\textrm{Gunning-Fog Grade} \n",
    "=\\; &0.4 \\bigg[ \n",
    "\\left(\n",
    "\\frac{\\textrm{Number of words}}{\\textrm{Number of sentences}}\n",
    "\\right) \n",
    "+100\n",
    "\\left(\n",
    "\\frac{\\textrm{Number of syllables}}{\\textrm{Number of words}}\n",
    "\\right)\n",
    "\\bigg]\n",
    "\\end{align}\n",
    "\n",
    "To count syllables, we've added a syllable_count function you can access via \n",
    "\n",
    "```\n",
    "from syllable_count import syllable_count\n",
    "syllable_count(\"syllable\")\n",
    "```\n",
    "\n",
    "Below, implement the function `flesch_index` and `fog_index` that computes the readability grade level for a given text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "def flesch_index(text):\n",
    "    # TO DO\n",
    "    pass\n",
    "\n",
    "def fog_index(text):\n",
    "    # TO DO\n",
    "    pass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3 Results\n",
    "\n",
    "Now that you've coded up the exercises, compute the grade levels for each of the texts given.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# to test the solutions\n",
    "# uncommon next line\n",
    "# from readability_solutions import *\n",
    "\n",
    "print(flesch_index(text_alice),fog_index(text_alice))\n",
    "print(flesch_index(text_phy),fog_index(text_phy))\n",
    "print(flesch_index(text_10k),fog_index(text_10k))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should expect a grade level around 7-10 for `alice.txt`, and around 16-19 for `physics.txt`, and 18+ for financial documents! \n",
    "\n",
    "It turns out 10-Ks are really *hard* to read legal documents!\n",
    "Now, let's compute the readability for all the 10-Ks we have"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "filelist_10k=!ls *10k*txt\n",
    "\n",
    "\n",
    "flesch = []\n",
    "fog = []\n",
    "\n",
    "for file in filelist_10k:\n",
    "    with open(file, 'r') as f:\n",
    "        text=''.join(f).replace(';','.')\n",
    "        flesch.append(flesch_index(text))\n",
    "        fog.append(fog_index(text))\n",
    "        print(file, flesch[-1],fog[-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Superficially, and according to our readability metrics, reading 10-Ks is harder than reading articles on theoretical physics!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Bonus exercise:\n",
    "How are the two readability grade-levels correlated? Compute the covariance matrix of the two readability indices we have on all the 10K documents, and make a scatter plot of Flesch index vs Fog index. Also perform a least-squared fit to the result and plot it as well.\n",
    "\n",
    "(change bottom cell to code and remove html tags for solution)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from scipy.stats import linregress\n",
    "\n",
    "\n",
    "# TO DO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color=\"white\">\n",
    "\n",
    "#solution\n",
    "cov = np.cov(flesch,fog)\n",
    "print(cov)\n",
    "\n",
    "%matplotlib inline\n",
    "\n",
    "plt.figure(figsize=(6,5))\n",
    "plt.scatter(flesch,fog) \n",
    "\n",
    "slope, intercept, r_value, p_value, std_err = linregress(flesch, fog)\n",
    "\n",
    "x=np.linspace(16.5,19,101)\n",
    "y=slope*x+intercept\n",
    "plt.plot(x,y)\n",
    "\n",
    "plt.xlabel(\"Flesch Index\")\n",
    "plt.ylabel(\"Fog Index\")\n",
    "\n",
    "</font>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
